Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs
نویسندگان
چکیده
Even with advances in materials science, fundamental limits in heat and power distribution are preventing higher CPU clock frequencies. Industry solutions for increasing computation speeds have concentrated on raising the number of computational cores available, leading to the wide-spread adoption of so-called “fat” nodes. However, keeping all the computation cores busy doing useful work is a challenge because typical high performance computing (HPC) workloads require reading and writing a steady stream of data from memory – contention for memory bandwidth becomes a bottleneck. Many commodity platforms have therefore embraced nonuniform memory access (NUMA) architectures that split up and distribute memory to be close to the cores. High-performance Message Passing Interface (MPI) implementations must exploit these architectures to provide reliable performance portability. NUMA architectures not only require specialized MPI point-to-point messaging protocols, they also require carefully designed and tuned algorithms for MPI collective operations. Multiple issues must be taken into account: 1) minimizing the number of copies required, 2) minimizing traffic to “remote” NUMA memory, and 3) carefully avoiding memory bottlenecks for “rooted” collective operations. In this paper, we present a kernel assisted intra-node collective module addressing those three issues on many-core systems. A kernel level inter-process memory copy module, called KNEM, is used by a novel OpenMPI collective module to implement several improved strategies based on decreasing the number of intermediate memory copies and improving locality to reduce both the pressure on the memory banks and the cache pollution. The collective topology is mapped onto the NUMA topology to minimize cross traffic on inter-socket links. Experiments illustrate that the KNEM enabled OpenMPI collective module can achieve up to a threefold speedup on synthetic benchmarks, resulting in a 12% improvement for a parallel graph shortest path discovery application. Keywords-MPI, multicore, shared memory, NUMA, kernel, collective communication
منابع مشابه
Locality and Topology Aware Intra-node Communication among Multicore CPUs
A major trend in HPC is the escalation toward manycore, where systems are composed of shared memory nodes featuring numerous processing units. Unfortunately, with scale comes complexity, here in the form of non-uniform memory accesses and cache hierarchies. For most HPC applications, harnessing the power of multicores is hindered by the topology oblivious tuning of the MPI library. In this pape...
متن کاملKernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective...
متن کاملParallel 3D fast wavelet transform on manycore GPUs and multicore CPUs
GPUs have recently attracted our attention as accelerators on a wide variety of algorithms, including assorted examples within the image analysis field. Among them, wavelets are gaining popularity as solid tools for data mining and video compression, though this comes at the expense of a high computational cost. After proving the effectiveness of the GPU for accelerating the 2D Fast Wavelet Tra...
متن کاملDesign and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical Hpc Platforms Using Functional Computation Performance Models
HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Heterogeneous many-core processors will be another source of intra-node heterogeneity in the near future. As modern HPC clusters become more heterogeneous, due...
متن کاملPATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations
PATUS is a code generation and auto-tuning framework for stencil computations targeted at modern multiand many-core processors, such as multicore CPUs and graphics processing units. Its ultimate goals are to provide a means towards productivity and performance on current and future multiand many-core platforms. The framework generates the code for a compute kernel from a specification of the st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010